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Abstract. A Wilson prime is a prime p such that (p — 1)! = —1 
(mod p 2 ). We report on a search for Wilson primes up to 2 x 10 13 , 
and describe several new algorithms that were used in the search. In 
particular we give the first known algorithm that computes (p — 1)! 
(mod p 2 ) in average polynomial time per prime. 



1. Introduction 
Wilson's theorem in elementary number theory states that 

(p — 1)! = —1 (mod p) 
for any prime p. The corresponding Wilson quotient is 

p 

and we define w p to be its residue modulo p in the interval — p/2 < w p < p/2. 
A Wilson prime is a prime such that w p = 0, or equivalently 

(jp — 1)! = — 1 (mod p 2 ). 

Only three Wilson primes are known: 5, 13 and 563. 

All previously published searches for Wilson primes have used algorithms 
for computing w p whose time complexity is essentially linear in p. (In this 
paper, unless otherwise specified, time complexity means number of steps 
on a multitape Turing machine, see [25].) Since the input size is G(logp), 
these algorithms should be regarded as having exponential time complexity. 
For example, the simplest possible algorithm is to multiply successively by 
the integers 2,3, ... ,p — 1, reducing modulo p 2 after each multiplication. 
The best known algorithm for computing w p has complexity O(p ' 5+o ^) 
(see below), but this is still exponential in log p. 

The main theoretical contribution of this paper is an algorithm that com- 
putes w p in polynomial time on average: 

Theorem 1. The Wilson quotients w p for 2 < p < N may be computed in 
time 0(N log 3+o(1) N). 
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Let tt(x) denote the number of primes p < x. By the prime number 
theorem, n(x) ~ x/logx, so Theorem [T] implies that we can compute each 
w p in time essentially log 4 p on average. While this result does not improve 
the complexity for computing a single w p , it is of course directly relevant to 
the problem of searching for Wilson primes. 

The key idea of the new algorithm is to exploit redundancies among the 
products (p — 1)1 for varying p. For example, the Wilson quotients for 
N < p < 2N in some sense all incorporate the product Nl. Instead of 
computing N\ (mod p 2 ) separately for each p, we will compute it modulo 
the product Y\n<p<2N P 2 ■ A remainder tree then yields ./V! (mod p 2 ) for 
each p. Using FFT methods for integer arithmetic, this can all be achieved 
in average polynomial time per prime. Applying this idea recursively leads 
to an algorithm for computing the desired residues (p — 1)! (modp 2 ). A 
detailed description is given in the proof of Theorem Q] in Section [2j 

However, the space requirements of this algorithm render it impractical for 
large N, and we must implement a time-space tradeoff to obtain a practical 
algorithm: 

Theorem 2. Let M < N. The Wilson quotients w p for M < p < N may 
be computed in time 

0{M log 2+o(1) M + (N-M + VN) log 3+o(1) N) 

and space 

0(A(M, N) + VNlog 1+o{1) N), 

where 

A(M, N) = max(iV - M, (n(N) - vr(M)) log N). 

The algorithm implementing Theorem [2] consists of two main phases 
that we call Stage 1 and Stage 2. Stage 1 involves computing M\ mod- 
mo rTM<p<Af^ 2 ' an d contributes the 0(M log 2+ °^^ M) term to the time 
bound. Stage 2, which contributes the second term, is a modification of the 
algorithm implementing Theorem [H 

To understand the space bound, note that (tt(N) — 7r(M))log N is (up 
to a constant) a bound for the number of bits in YIm<p<n P- ^ n ^ ac ^' as 
long as the interval width N — M is not too small relative to N, we expect 
(tt(N) — 7r(M)) log N to be well approximated by N — M. We cannot prove 
this due to our ignorance concerning the distribution of primes; the best we 
can say unconditionally is that A(M, N) = 0((N — M)logN). However, 
we do have the following: 

Lemma 3. Suppose that N — M > ^/Nlog 2 N. Assuming the Riemann 
Hypothesis, we have 

A(M,N) =0(N -M). 
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Table 1. Primes 10 6 < p < 2 x 10 13 for which \w p \ < 10 



Proof. The Riemann Hypothesis implies that ir(x) = Li(x) + 0(^/xlogx) 
|33j . If M < y/~N the result clearly holds. Now suppose M > y/N] then 

P N 1 7V 7" 

(n(N) -ir(M))\ogN = / ^—dt + 0(Vn log 2 N) 

Jm log* 



< 2(JV - M) + 0(\/iVlog 2 N) 

= 0(N-M). □ 

In practice this means that the space usage behaves essentially like N—M, 
up to some constant that depends on the precise details of the implementa- 
tion, provided that N — M is somewhat larger than v^V- 

From these remarks we may draw the following practical conclusions. 
The average time per prime in Stage 2 is essentially log 4 p, the same as for 
Theorem [TJ However in Stage 1 the average time per prime behaves like 

P i 3 
■ log p. 



N-M 

This is no longer polynomial in logp, and represents the price we pay for 
restricting the space consumption. If we now assume that the amount of 
RAM is fixed, then a reasonable strategy to compute w p for all p up to some 
bound iVo is to apply Theorem [2] to successive intervals M < p < N, where 
N < N , and where N — M is chosen as large as possible given the available 
RAM. 

This is in fact what we did, for all p < 2 x 10 13 . We found no new Wilson 
primes up to this bound. Altogether this consumed over 1.1 million hours 
of CPU time. It is traditional, though meaningless, to give tables of 'near 
misses'. Table [T] shows the smallest \w p \ that we found, and Table [2] shows 
the smallest residues when ordered by \w p /p\. 

Retaining all of the residues would have required archival storage in 
the terabyte range. Instead, we only recorded those residues for which 
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Table 2. Primes p < 2 x 10 13 for which \w p /p\ < 1.5 x 10 



\w p \ < p/50000, i.e. approximately 0.004% of the primes examined. There 
are 27 039 026 such primes; the residues may be downloaded from the third 
author's web page (247 MB). 

The search for Wilson primes has an interesting history. The case p = 5 
is trivial, and p = 13 was noticed at least as early as 1892 [241 p. 318]. In 
1913, Beeger used the congruence 

p — 1 

w p = Bp_ 1 (modp), 

P 

where is the fc-th Bernoulli number, together with a published table of 
Bernoulli numbers, to check that there are no other Wilson primes less than 
114 [2]. Several years later he proved the congruence 

(1) ( p -l)! = (-l)(P-l)/2^^1^y (2 P-l) (modp 2 ), 

which reduces computation of w p to that of ((p — l)/2)! (mod p 2 ). He used 
this identity, together with a direct computation of the relevant factorials, 
to produce a table of w p for p < 300 |3j . We do not know when (pQ) was first 
discovered, but it appears (without proof) in [24 1. 

Lehmer later used Beeger's original method together with a newly ex- 
tended table of Bernoulli numbers to compute w p for p < 211 [21] . In a com- 
panion article, she mentions that Beeger communicated that his earlier table 
contains four errors, namely for p = 127, 167, 173 and 241 [22]. Lehmer's 
table is correct, but there is an additional unnoticed error in Beeger's table, 
for p = 239. The errors are rather clustered together, and one speculates 
on the human factors (computational exhaustion?) that may have been re- 
sponsible. For the modern reader, it is very easy to forget just how much 
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effort is required to generate such a table by hand. We invite the reader to 
spend a few minutes verifying that p = 13 is indeed a Wilson prime! 

After these early attempts, the search entered the computer age with the 
work of Goldberg, who used the Bureau of Standards Eastern Automatic 
Computer (SEAC), one of the first stored-program electronic computers, 
to test all p < 10 000 [15]. In this interval, not far beyond the previous 
search bound, was found the third Wilson prime p = 563. Froberg pushed 
this further to 30 000 and then 50 000 [TH [13]. In [13] he also discusses 
a heuristic concerning the distribution of Wilson primes. Namely, if one 
assumes that w p is uniformly distributed modulo p, then the probability 
that p is a Wilson prime is 1 /p, and the expected number of Wilson primes 
less than X is 

- = loglogA + c + o(l), 

p<X p 

where c = 0.2615... is Mertens' constant. This suggests that there should be 
infinitely many Wilson primes, but that they should be very rare. 

The search bound was successively increased to 200183 by Pearson [26J, 
1017000 by Kloss [H], 3 000 000 by Keller (see [271 P- 350]), 4000 000 by 
Dubner [IT], 10 000 000 and then 18 876041 by Gonter and Kundert [19]. 
(The computation was halted at 18 876 041 due to a power failure — see [271 
p. 350]. Many authors have cited an unpublished manuscript "All prime 
numbers up to 18,876,041 have been tested without finding a new Wilson 
prime" by Gonter and Kundert, but we have been unable to locate a copy.) 

None of these authors give many details on how they performed the com- 
putation. It seems likely that they were all aware of (pQ), and that they 
computed ((p — l)/2)! (mod p 2 ) by simply multiplying successively by 2, 3, 
. . . , (p — l)/2, reducing modulo p 2 at frequent intervals. 

Significant algorithmic progress on the problem was made by Crandall, 
Dilcher and Pomerance, who searched up to 5 x 10 8 [9]. They introduced two 
new main ideas. The first is that for many p, there exist identities better 
than ([TJ). For example, if p = 1 (mod 4), write p = a 2 + b 2 with a = 1 
(mod 4). Then we have the remarkable identity (proved in [7J) 

kiP- l )\ (, , ( p 

111 1 ' 2a (mod p 



\{p-l)) V 2 J\ 2a 

Together with (UJ) this reduces the computation of w p to that of ((p — l)/4)! 
(mod p 2 ). Similar identities are used in [9] to reduce to computation of 
((p — l)/6)! (mod p 2 ) in the case that p = 1 (mod 6). 

We extend this technique considerably in Section [3l showing how to reduce 
to computation of {{p — l)/e)! (mod p 2 ) for essentially any 'small' divisor e 
of p — 1. 

Second, [9j introduced a scheme that replaces most of the modular mul- 
tiplications by modular additions. Indeed they show how to compute N\ 
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(mod p 2 ) using iV + 0(N 2 ^ 3 ) additions and only 0(N 2 / 3 ) multiplications. 
This optimisation does not play a role in the present work. 

Crandall-Dilcher-Pomerance also mention an algorithm, essentially due 
to Strassen, that computes (p — 1)! (mod p 2 ) in time 0(p°- 5 +°( 1 )); however 
they found it was not competitive with their quasi-linear time algorithm 
over the range of their search. This can be improved by a factor of \ogp [6], 
yielding the best known algorithm for computing a single w p . 

Following this work, Carlisle-Crandall-Rodenkirch extended the search 
to 10 9 in 2006 (see [22 p. 241]) and then 6 x 10 9 in 2008 (personal com- 
munication). This work has not been published; we sketch their algorithm 
here. The basic idea is to explicitly compute the exponents appearing in the 
prime factorisation iV! = p^ 1 ■ ■ -p e T r ', and then compute this product, term 
by term, modulo p 2 . The complexity is 0(N/ log N) multiplications, which 
improves on the algorithms used in [9] by a factor of log N. 



2. Computing Wilson quotients in average polynomial time 

In this section we give algorithms that prove Theorems [T] and [2j The 
algorithms depend on three fundamental operations: integer multiplication, 
integer division, and enumeration of primes. We discuss the complexity 
of these operations first. We will give only a high level description of all 
algorithms, allowing the industrious reader to supply their own details con- 
cerning data layout and access patterns by the Turing machine. 

If X and Y are integers with at most N bits, their product can be com- 
puted in time 0(iVlog 1+o(1) N) and space O(N) using FFT methods [29lfT4"]. 
For division with remainder, we want Q = [X/Y\ (assuming Y > 0) and 
R = X mod Y. These can be computed in time 0(N log 1+ °^ N) and space 
0(N) |5]. 

Consider the problem of enumerating the primes M < p < N . In our im- 
plementation (see section H]) we used a simple sieve of Eratosthenes, i.e. after 
precomputing a table of primes q < y/N, we initialise a bit-array of length 
N — M and strike out multiples of each q to eliminate the composites. As- 
suming a RAM model with unit time access to arbitrary array elements, 
and in which integers of size 0(log N) can be manipulated in unit time, the 
complexity is at most 



£ 

q prime 



N - M 



- Yl ( — q M +1 ) =0((iV-M)loglogiV + V^V) 



by Mertens' theorem. 

While this simple algorithm is perfectly adequate in practice, in the Turing 
model the analysis is incorrect, because of the unavailability of unit-time 
array access. For completeness, Proposition [5] below gives a bound for the 
Turing model, following the approach suggested in [30, p. 226]. This result is 
not optimal, but suffices for our purposes. The key tool is merge sort, which 
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can be implemented efficiently on a Turing machine; see [TO] for a discussion 
of this, and for further applications of this observation in computational 
number theory. 

Proposition 4. The primes p < N may be enumerated in time 

0(N log 2 N log log N) 

and space 

0(N log iVloglogiV). 

Proof. First enumerate the primes q < y/N by trial division. There are 
0(y/N) candidates, and each requires 0{N 1 / i ) divisibility tests, so the time 
cost is 0(N 3 / 4+0 ^). 

Now for each q < y/N, generate the multiples of q bounded by N. The 
number of such multiples is d = ^2 q< ^\_N/q\ = O(NloglogN). Each suc- 
cessive multiple is computed via a single addition of integers of size 0(log N), 
so the time and space required to construct the list is 0(dlog N). Sort the 
list using merge sort; this costs time 0(d log d log N) = 0(N log 2 iV log log N) 
and space 0(dlog N) = 0(N log N log log N). The complement of the re- 
sulting list in 1 < x < N is the desired set of primes, and can be computed 
in one more pass in time 0(dlogN). □ 

Proposition 5. The primes M < p < N may be enumerated in time 

0((N- M + y/W) log 2 N log log N) 

and space 

0{{N -M + VN) log N log log N) 

Proof. First enumerate the primes q < \/N using Proposition [H This re- 
quires time 0(y/N log 2 iV log log N) and space 0(y/N log iVToglog N). 

Now for each q < y/~N, generate the multiples of q in the interval M < 
x < N. Determining the first multiple of each q, namely q\(M + l)/q~\, 
costs 0(log 2 N) per prime (assuming naive arithmetic), so 0(y/N log 2 N) 
altogether. The number of such multiples is 

d < y, tt N - M )M < E - M )/« + 1 

q<VW q<VN 

= 0((N - M) log log N + y/N) = 0((N — M + y/N) log log N). 
The proof is concluded in the same way as Proposition HJ □ 

Having dealt with these preliminaries, we now turn to computing Wilson 
quotients. First we give a simple algorithm that proves Theorem [Tj and 
which will serve as a template for the more involved algorithm needed for 
the proof of Theorem [2j 
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Proof of TheoremUi First use Proposition E] to enumerate the primes p < N 
in time 0{N log 2+o(1) N). 

Let d = flog 2 N] . For each < i < d and < j < T let 

{ N N 

U h] = \k^Z:j-<k<(j + l)- 

Thus Uifi, . . . , c7j 2 i -i partition the interval < k < N into 2* sets of roughly 
equal size. For < i < d we have the disjoint union U%j = t/i+i^jUC/i+i^j+i, 
and = or 1 for every j. 

For each i, j let 

= n «u = n p 2 - 

p prime 

Note that A^j = Aj+x, 2j^i+i,2j+i, an d that Aij has 0(2~ l NlogN) bits. 
We have = 1 or k according to whether Udj = or {k}. We may 
compute all the Aij using a product tree [5], working from the bottom of 
the tree (i = d) to the top (i = 0). The cost at each level of the tree 
is 0(2 i (2- i NlogN)\og l+ °^ N) = 0{N log 2+ °^ N), so the total cost to 
compute all the Aij is 0(N log 3+0 ^^ N). Similarly we may compute all 
the Sij using a product tree and the precomputed table of primes, in time 
0(N log 3+o(1) N). (In fact, because of the estimate T, P <N l °gP = °( N )> 
this product tree takes time only 0(N log 2+ °^ N), but we will not use this 
here.) 
Now let 



N 



(mod Si t 



Wij = |~[ A i)T (mod Sij) = 

0<r<j 

We may compute all the Wij in time 0(A r log 3+0 ^ 1 ^ N) by working from the 
top of the tree to the bottom, starting with Wo,o = 1 an d then using the 
relations 

(2) W Wj = Wi,j (mod S Wj ), 

(3) W i+1;2 j+i = W it jAi + i i2 j (mod S i+lt 2j+i)- 

Finally we may read the Wilson quotients off the bottom layer of the Wij 
tree: for each p < N, let j = \2 d p/N] — 1. Then Uaj = {p}, so Sjj = p 2 
and Wdj = {p— 1)! (mod p 2 ). □ 

Now we consider Theorem [2j The first step (Stage 1) is to evaluate M! 
(mod S) where S = Y\m<p<nP 2 - Using a full product tree for Ml would 
lead to time complexity 0(M log 3+o(1) M), since logM! = 0(M logM). In 
the next proposition, we reduce this to 0{M\og 2+0 ^ M) by using a space- 
optimised variant of the factorial algorithm of [30] . In practice Stage 1 makes 
a significant contribution to the total running time, so the reduction in time 
by a factor of logM is significant. 
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Proposition 6. Let S > be an integer with at most B bits. Then N\ 
(mod S) may be computed in time 

0(N log 2+ °W N) 

and space 

O (B + y/N log N log log N ) . 

Proof. Let N\ = p e ^ ■ ■ -p e r r be the prime factorisation of Nl. For each j we 
have 

(4) e 3 = [N/pj\ + [N/ P )\ +■■■+ lN/pf°* N ^] < * 

P j i 

Let d = \log 2 {N + l)] , so that N < 2 d , and for each 1 < j < r let 

e i = /o,j + 2/ij H 1" 2 d " 1 /d-i,j 

be the binary representation of ej, i.e. with foj = or 1. Then 

(5) iV! = Ao(yl 1 ) 2 (A 2 ) 4 ---(A d _ 1 ) 2[i - 1 , 
where 

A i =p{ iA ---p f r i - r . 

Observe that if pj — 1 > 2~*iV then e, < 2* by ©, so /y = 0. Thus actually 

A= II P? J > 

and we have the following estimate for the size of Ai\ 
logAi< logp = 0(2- i N). 

p<2-' ; Af+l 

We will first show how to compute A{ (mod S) in time 
0((2- l N + VN) log 2+o(1) N) 



and space 0(B + V N log N log log iV) . 

Partition the interval 1 < k < 2~ l N + 1 into subintervals, say T±, . . . ,T m , 
where each subinterval, except possibly the last, has length 

f r-r B 
max v N. 



log log log N / 

For each subinterval T r , perform the following operations. 

First use Proposition [5] to enumerate the primes in T r . For each subinter- 
val, this uses space 0{(L+y/N) log iV log log N) = 0(B+^/N log NloglogN). 
The time cost for each subinterval of length L is 0((L + y/N) log 2+ °^ N) = 
0{L\og 2+0 ^ N). There are at most 2~ l N/L such subintervals, so their total 
cost is 0(2- { N log 2+o(1) N). The last interval has length at most 2~*N, so 
contributes O^^N+y/N) log 2+o(1) N). The time cost over all subintervals 
is therefore 0((2~W + y/N) log 2+o(1) N). 
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Now compute for each pj G T r . Using (Jl|), the time complexity is 
0((log N/logp) log 1+o(1) N) = 0(log 2+0 W N) for each prime, which over all 
subintervals is 0(^(2"* AT) log 2+o(1) N) = 0(2- i Nlog 2+o( - 1) N). 

Append the primes for which fij = 1 to a separate buffer. Whenever the 
total length of that buffer reaches B (i.e. when it contains B / log A" primes), 
or when we finish processing the last interval, use a product tree to multiply 
together the primes in the buffer (using space 0(B)), and then clear the 
buffer to receive more primes. Accumulate the result of the product tree 
into a running product for Ai (mod S), using a single multiplication mod- 
ulo S (again space usage is 0(B)). The total time for the product trees over 
all intervals is 0((log A) log 2+o(1) B) = 0(2~ i N\og 2+ °^ N), since we may 
certainly assume that B = 0(log N\) = 0(N log N). The time for the mod- 
ular multiplications is 0(L(logAj)/B|Blog 1+o(1) B) = 0(2^ N log 1+o(1) N). 
We conclude that Ai (mod S) may be computed within the promised time 
and space bounds. 

Now let 

C i = A i (A i+1 ) 2 ---(A d ^ 1 f- 1 - i 

for < i < d - 1. We have C d - X = (mod S) and d = Ai(C i+1 ) 2 

(mod S) for < i < d — 2. Using these relations, we compute in sequence 
A d _i,C d -i, A d ^ 2 ,C d -2, ■ ■ ■ ,A Q ,C (rnodS*). By ([5]), at the end we have 
obtained Cq = N\ (mod S). To estimate the time complexity, note that 

logCi = 0(2- l N + 2(2~ i - l N) + ■■■ + 2 d - 1 - i (2- d+1 N)) 
= O^^AriogAO. 

Therefore computing C{ = Ai(Ci + \) 2 (mod S) from Ai (mod S) and Ci + \ 
(mod S) has time complexity 0(2"Wlog 2+o(1) N). (Here we have used the 
fact that if X and Y are integers with at most M bits, then XY (mod S) can 
be computed from X (mod S) and Y (mod S) in time 0(Mlog 1+0 ^ M). 
Indeed if XY < S then no modular reduction is performed, whereas if 
XY > S, we need to perform one modular reduction whose time cost is 
bounded by a constant multiple of the cost of the full multiplication.) The 
space complexity is 0(B), with the previous values of Ai and Cj discarded 
as we proceed. Summing over i, the total time cost is 0(Anog 2+ °^ N). □ 

Finally we may prove Theorem [2l 

Proof of Theorem® Let L = N — M . We must first enumerate the primes 
M < p < N. Using Proposition [5] directly for this would use too much 
space, but we may instead apply it to successive subintervals of length K = 
[L I log log log A^J . (We tacitly assume that L > log N log log N; otherwise 
the theorem is trivial.) The space used is 0((K + y/N) log A r log log N) = 
0(L+y/N log N log log AQ, plus the space needed to store the primes, namely 
0((ir(N) - 7r(M))log N). The total is 0(A(M,N) + \/N log A^ log log N) . 
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The time over all subintervals is 0(L log 2+o(1) A + (L/K)VN log 2+ ° w A) = 
0(Llog 2 +°W TV + VAlog 3+ °« TV). 

Multiply the primes together using a product tree to obtain S = Soft = 
Wm<p<n P 2 ■ The number of bits in S is B = 0(A(M, A)), so this takes 
space 0{A(M,N)) and time 0(A(M, N) log 2+o(1) A) = 0(L log 3+o(1) A). 
(This last time bound is unduly pessimistic, but this does not matter as it 
is dominated by other steps later.) 

Use Proposition to compute Ml (mod S) in time 0(M log 2+ °^ M) and 
space 0(A(M,N) + \/N log A log log A) . This is Stage 1. 

For Stage 2, we use a similar strategy as in the proof of Theorem [TJ but 
taking additional care to economise on space usage. Let d = [log 2 L] . For 
each < % < d and < j < 2* let 

U itj = {keZ:M + j^<k<M+ (j + 1)| 



For each i this yields a partition of the interval M < k < N into 2 l sets. As 
in Theorem [H put 

= n fc > = n p 2 - 

p prime 

The definition of W% j is slightly different; we take 



(mod S 1 -, 



Wij=M! J] Ai, r (mod 5^) = 

0<r<j 

We do not have enough space to store all of the A^j and S^j, so we must 
proceed differently to the proof of Theorem [TJ We will use a strategy similar 
to the proof of [344 Lemma 2.1]. 

We begin at the top of the tree with Wo,o = M! (mod So,o), which was 
computed above using Proposition As in the proof of Theorem [TJ we 
use relations ([2]) and (|3|) to work our way down the tree. Every new pair 
of values Wj+i^j and Wi+i 2j+i overwrites the previous value of Wij. For 
fixed i, the total size of the Wij at level % is 0(A(M, N)), so the space for 
storing the Wij never exceeds 0(A(M, N)). 

For the top £ = [2 log 2 log A^J levels of the tree, we recompute each re- 
quired Aij and Sij as we encounter them, discarding intermediate values 
(i.e. Aij and Sy from lower levels of the product tree) as we proceed. Also, 
in the evaluation of ([3]), we do not compute Aj+x,2j+i exactly, but rather 
only modulo Sj+i^'+i, by reducing as appropriate during the product tree 
computation. The time complexity contributed by each level of the tree is 
thus 0(L log 3+o(1) N) (this is a factor of log N more than in Theorem [lj 
due to the recomputations) , but over the first I levels this amounts to only 
0(Llog 3+o(1) A log log N) = 0(Llog 3+o(1) A). 

When we reach level £, we switch back to the strategy of Theorem [TJ For 
each j at level £, we compute the entire trees beneath Agj and Sij. This 
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requires space 0(tog(^-) log N) = 0(2~ e Llog 2 N) = 0{L) = 0(A(M,N)). 
The time contribution from each level is 0{L log 2+ °^^ N), so over all levels 
is 0{L log 3+ °*- 1 ^ N). The Wilson quotients are extracted from the Wdj just 
as in Theorem [TJ □ 



3. Factorial identities modulo p 2 

Let e be an even divisor of p — 1, and let / = (p—1) /e. In this section we 
describe a method for reducing computation of (mod p 2 ) to that of 

/! (mod p 2 ). 

As mentioned in the introduction, identity (pQ), corresponding to the case 
e = 2, has been applied to the computation of Wilson quotients for almost 
a century. The cases e = 4 and e = 6 were introduced by [9] . 

Our method can be applied in principle to any e. The simplest case, and 
the only case we will describe in this paper, is when the e-th cyclotomic field 
over Q has class number 1. It is known that this occurs for precisely the 
following values of e ([351 Ch. 11]): 

2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 

32, 34, 36, 38, 40, 42, 44, 48, 50, 54, 60, 66, 70, 84, 90, 

and these are the values of e that we used in our implementation. 

It is straightforward to modify the algorithms given in the proof of The- 
orem [2] to compute fl (mod p 2 ) instead of (p— 1)! (mod p 2 ). For example, 
given a set T of primes p lying in the interval M < p < N and satisfying 
p = 1 (mod e), the modified Stage 1 involves using Proposition[6]to compute 
LAf/eJ! (mod \[ p(iT p 2 ). 

To apply this to the main Wilson prime search, each prime p is assigned to 
the 'best' possible e for that prime, i.e. the largest divisor of p — 1 appearing 
in the above list. Then for each e, our strategy is to use the (suitably 
modified) algorithm of Theorem [2] to compute w p for all p assigned to e. 

It is a difficult theoretical problem to analyse the savings that accrue from 
this strategy. If we assume that the amount of RAM is fixed, then Stage 1 
will dominate for sufficiently large N. In Stage 1 we expect a speedup by 
roughly a linear factor of e, since we are only computing [M/e\ ! rather than 
Ml. Therefore, in the limit of large N, we expect a savings of a factor of e 
for the primes assigned to e. 

In practice however these ideal conditions are not met. Stage 2 does make 
a significant contribution, especially for larger values of e. The effect of e on 
Stage 2 is complex. As e increases, a fixed interval M < p < N will contain 
fewer and fewer primes of interest. The number of such primes depends in 
a complicated way on the complete list of admissible e. To make best use 
of available RAM, for larger e we will generally choose a larger interval, 
so that the number of primes in the interval is roughly constant, but the 
relationship is not linear. 
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In addition, we must take into account the cost of deducing (p — 1)1 
(mod p 2 ) from ((p — l)/e)! (mod p 2 ). We refer to this step of the computa- 
tion as Stage 3. We have not attempted to give a theoretical bound for the 
cost of Stage 3. In general it becomes more expensive as e increases. In our 
computation it accounted for only a few percent of the total running time 
(see Table 0). 

Let us estimate the overall savings, over many primes, under the assump- 
tion that the speedup is linear in e, and ignoring the cost of Stage 3. Let 
S be a set of permissible values of e, for example, the set {2,4,..., 84, 90} 
given above. We assume that for each e 6 S, we apply the above strategy 
to those primes p for which e is the largest divisor of p — 1 that appears in 
S. Let Q s = LCM(S). For k £ (Z/Q S Z)*, let b s (k) = max{e e S : k = 1 
(mod e)}. Then the expected savings is 

For example, if we only use identity (pQ), then S = {2}, Qs = 2, and 
Rs = 1/2, so we save a factor of 2 over the naive algorithm. 

The identities used in [9] correspond to choosing S = {2,4,6}, in which 
case Q s = 12 and R s = (1/6 + 1/4+ 1/6 + 1/2)/4 = 13/48, saving a further 
factor of 24/13 ps 1.85. 

Taking S to be the full set S = {2, 4, ... , 84, 90}, we have 

Qs = 6983776800 = 2 5 • 3 3 • 5 2 • 7 • 11 • 13 • 17 • 19. 

A brute force computation finds that 

22695187978681 ni „ 

Rs = ~ 0.112, 

201921527808000 

indicating a further savings of a factor of roughly 2.41 compared to [9]. 

Now we explain the reduction. Fix a primitive e-th root of unity uj G Z p . 
Let r p : Z p — > Z* denote the p-adic gamma function. The next proposition, 
whose proof is adapted from [3J Thm. 9.3.1], gives a congruence between 
(p — l)!//! e and a special value of the p-adic gamma function. 

Proposition 7. Let 



F i=i 



Then 



{ '' ~ ' *' - -r p (l/e) e (l + pC) (mod p 2 ). 



/! 

Proof. Let M = p 2 - (p 2 - l)/e = p 2 - /(p + 1). Then M = 1/e (mod p 2 ) 
and 1 < M < p 2 . By the definition and elementary properties of T p (x) (see 
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for example |20} Ch. 14]) we have 



r„(l/e) = r p (M) (modp 2 ) 



= - Yl j ( mod p )• 

i<i<M 

Splitting the product into blocks of length p we obtain 



/[M/pl-lp-i \ /\M/p\p~\ \ 1 

r P (i/e)=- n n^+ r ) n (mod/). 

\ fc=0 r=l / \ j=M / 

Since [M/p] =p- f + [f/p\ =p- f, 

/p-f-ip-i \ / P 2 -/p-i \ _1 

r P (i/ e )=- n n^+ r ) n > (mod/). 

For the first term, observe that for any k £ Z we have 



kh^lZZl = ff(l + Wr) = 1 + fcpV 1/r = 1 (mod 

(p-1)! - LJ r ^— ' 

v ' r=l r=l 

Therefore 

p-/-1p-i 

n n^ +r ) = (p~ i ) !P_/ ( m ° d p 2 )- 

fc=0 r=l 

For the second term, 

p 2 -/p-i -/p-1 / 

n j= n i=(- i ) / n( r ' + ^) (modp 2 ). 

j=p 2 ~fp-f j=-fp-f r=1 
To evaluate this last product, note that 

ULi(r + fp) = -q (1 + fp/r) = 1 + fp J2l/r (mod/). 
•'" r=l r=l 

Moreover, for any 1 < j < e — 1, 

(l-^)'-(l-^) = 1 g ^ ( _^ k 



p p^\k 



p ^ (p-l)(p-2)---(p-k + l) t jk 

h — *(*-d-i — ( ] 

_ ^ il_ LL w ifc = _ ^ oj jk /k (mod p). 



k\ 

k=l k=l 
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Thus 



Since 



we get 



e— 1 p— 1 p— 1 



c 



j=l k=l k=l j=l 

e-1 r 



2>*y = -i + 



if e | fc, 
otherwise, 



C 



Ei:- e E- = -E 1 ^ (modp). 



k ^—^ er 

k=l r=l r=l 



Putting everything together, we have 

-( p -iyp-f(-i)f 

T *W e > = _ fp C ) ( m °dp)- 

From Wilson's theorem we have (p — l)\ p = — 1 (mod p 2 ), and so 

Rearranging, we obtain the desired formula. □ 

Next we will use the Gross-Koblitz formula to relate r p (l/e) to a certain 
Gauss sum. Let K = Q(Ce)> where £ e is a primitive e-th root of unity. The 
ring of integers of K is Ok = Z[£ e ]. Let uj$ G Z be an integer congruent to 
co (mod p), and let P = (p, £ — cjq)- Then P is a prime ideal of of degree 
1 lying above p, i.e. Ok/P — F p . Let x : F* — > K* be the (-/)-th power 
of the Teichmiiller character; that is, xi u ) = u ~ (mod P) for any u £ F*. 
Define the Gauss sum 

p-i 
i=i 

where £ p is a primitive p-ih root of unity. 

Proposition 8. We have S{x) e £ K. Regarding K as embedded in Q p via 
the map that sends C, e to co, we have 

_ Vr[1/er = tML. 

Proof. The first statement follows from [20, Ch. 1, Thm. 1.3(i)]. The second 
statement is a consequence of the Gross-Koblitz formula, for example |20|. 
Ch. 15, Thm. 4.3]. In the notation of [20], take r = 1, q = p, a = p — 1 — /. 
The above formula falls out after taking e-th powers. □ 

The final ingredient is the Stickelberger factorisation of the ideal of K 
generated by S{x) e - For c 6 (Z/eZ)*, let a c denote the automorphism of 
K/Q that sends Ce to Q- 



16 



EDGAR COSTA, ROBERT GERBICZ, AND DAVID HARVEY 



Proposition 9. 

(sur)= n *c-i(py- 

c=l 
(c,e)=l 

Proof. Raise both sides of [201 Ch. 1, Thm. 2.2] to the power of e. □ 
Proposition 10. Suppose that P is principal, and let 9 be a generator. Let 

e-l 

p= n °c-i(o) c £o K . 

c=l 
(c,e)=l 

TTien 

S(x) e = C/3 

/or some < i < e. 

Proof. By Proposition [9J S'(x) 6 an d /3 differ by a unit of Ok- Moreover, 

c c 

so 

^_i(/3) =n^w e = % Q («) e = ^(p) e = P e . 

c 

Thus the image of j3 under every complex embedding K — > C has absolute 
value p e l 2 . But S(x) e has the same property [;20j p. 4]. Therefore S(x) e /f3 
has absolute value 1 in every complex embedding, and so is a root of unity 
in K [351 Lemma 1.6]. Since e is even, every root of unity is a power of C e , 
and the conclusion follows. □ 

Theorem 11. Let p = 1 (mod e), where e is even. Assume that K = Q(Ce) 
has class number 1. Assume we are given as input: 

• a primitive e-th root of unity in F*, represented as an integer 1 < 

L0 < p, 

• a generator 6 of the ideal P = (p, £ e — uio), represented as 9 = g(Ce) 
for some polynomial g £ Z[x] of degree less than <j){e), and 

• /! (mod p 2 ). 

Then we may compute (p — 1)! (mod p 2 ) using 0(e 2 + elogp) arithmetic 
operations on integers with O(logp) bits. 

The big-O estimates given in the above theorem are strictly speaking 
meaningless, since they only apply to finitely many e. We give the esti- 
mates anyway as an indication of how the running time might reasonably 
be expected to behave in practice. 

We may compute a suitable ojq using a simple probabilistic algorithm as 
follows. Select a random 1 < x < p — 1. Then ujq = x-* (mod p) has order 
exactly e with probability (p(e)/e > 1/e. We can compute the exact order 
using at most e arithmetic operations in Z/pZ. This is repeated until we 
find a suitable ojq. 
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The computational bottleneck is actually in finding 8, which we consider 
after the proof of the theorem. 

Proof. We will take 'arithmetic operation' to mean an addition or multipli- 
cation modulo p, p 2 or p 3 . 

We first lift the root of unity, putting oj\ = 0Jq (mod p 3 ), so that oj\ = oj 
(modp 3 ). This requires O(logp) arithmetic operations. We next compute 
the powers oj\ for < i < e, using 0(e) arithmetic operations. Computing 
C from Proposition [7] requires O(elogp) arithmetic operations. 

Let 7 be the image in Z p of f3/p, where f3 is as in Proposition 1101 i- e - 



e-l 

l 

7 



- n 

P 0=1 

(c,e)=l 



With this formula, we may compute 7 (mod p 2 ) using 0(e 2 ) arithmetic 
operations. Combining Propositions El El [9] and [lOj, we have 

oj~\p- 1)! = (-/!) e 7(l +pC) (mod p 2 ) 

for some < i < e, so we can compute u~ l (p — 1)1 (mod p 2 ) using a further 
O(loge) operations. However, we know that (p — 1)1 = — 1 (mod p), so we 
can determine i by comparing with the tabulated powers of oj. □ 

Before discussing the computation of 9, we illustrate Theorem 1111 with a 
numerical example. Take p = 3333331, e = 18, / = 185185, and the 18th 
root ujq = 1819843. The Teichmiiller lift is 

oj = 1819843 + 1422487p + 90367p 2 (mod p 3 ), 

and 

c= (i-^-(i-. ) + ... + (i-^-(i-^) =418399 

p 

Using the cyclotomic GCD algorithm discussed below, we find a generator 
= g((e) of P = (p, Ce ~ w ) given by 

g(x) = -5x 5 - 10x 4 + 7x 3 + 3x 2 + Wx - 4. 

Then 

7 = - 5 (o;) 5 (a; 11 ) 5 5 (a; 13 ) 7 5 (a; 5 ) 11 5 ( W 7 ) 13 < 7 (a; 17 ) 17 
p 

= 1628187 + 503367p (mod p 2 ) . 
Now assuming that we have computed 

/! = 461190 + 275007p (modp 2 ), 

we find that 

w -*(p - 1)! = (_/!) e 7 (l + p C) = 1780730 + 2171988p (mod p 2 ). 
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Comparing with the powers of w, we find that loq = —1780730 (mod p), so 
i = 3 and 

(p-l)} = 3333330 + 27003p (modp 2 ). 

We conclude that w p = 27004. 

Now we consider the problem of computing 9. The standard approach to 
the ideal generator problem is based on lattice reduction (see for example 
[8]), and indeed there exist highly optimised implementations in software 
packages such as Pari/GP [32]. 

After some experimentation we settled on a different approach, which 
we found to be considerably faster than Pari in practice. Our algorithm is 
closer in spirit to the elementary Euclidean GCD algorithm. We emphasise 
that this is not a general-purpose algorithm for finding ideal generators: it 
assumes that K has class number 1, and also uses the fact that we know in 
advance that the generator is an irreducible element whose norm is not too 
small. In addition we are unable to prove that the 'algorithm' terminates. In 
practice we find that it does terminate quite quickly. Pseudocode is shown 
in Algorithm [T] below. The algorithm is applied to the inputs X = p and 
Y = Q e — loq, and their GCD is precisely the desired 6. 



Algorithm 1: Heuristic cyclotomic GCD 
Input: X,Y G K 

S = precomputed list of elements of Ok of small norm 
Output: A generator of (X,Y) 

while X / and Y / do 

if N(X) < N(Y) then swap X and Y 
Q <— an element of Ok near X/Y 
Z «- X -QY 

if N(Z) < N{Y) then X <- Z 
else 

U ^— randomly selected element of S 
]fU\Y then Y <- Y/U else X <- XU 
end 
end 

if X = then return Y else return X 



l 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 



Several aspects of the algorithm deserve further discussion. 

All elements of Ok appearing in the algorithm are represented exactly, 
as Z-linear combinations of the basis elements {1, ( e , . . . , Ce -1 }? where d = 
<j)(e) = [K : Q], i.e. as polynomials in Q e . We first attempt to run the algo- 
rithm with all coefficients represented by signed 64-bit integers, and ignoring 
all overflows. If the algorithm terminates, we can check the output by veri- 
fying that the proposed 6 divides both p and Q e — ujq. This usually succeeds. 
If it is incorrect, or if the algorithm runs for too long without terminating, 
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we restart it. If this fails several times, we switch to an implementation that 
uses an arbitrary precision representation for the coefficients. This elimi- 
nates the possibility of overflow, so that if the algorithm terminates, the 
output is guaranteed to be correct. Again, if it runs for too long, we restart 
it. In practice this always eventually succeeds. 

Exact multiplication of elements of Ok (lines |4] and [8]) is achieved by 
naive polynomial multiplication followed by reduction modulo the cyclo- 
tomic polynomial (f> e (x). Exact division (line [8]) is achieved by the formula 
X/Y = X a(Y)/N(Y), where the denominator N(Y) = H a a(Y) is a 
rational integer. Here a denotes an automorphism of K, which is evaluated 
by cyclic permutation of coordinates followed by reduction modulo <j) e {x). 

Let 7~i, rf, . . . , T d/ / 2 > be the complex embeddings K C, and let r = 
(ti, . . . ,T d / 2 ) : K — > C d//2 be the corresponding vector of embeddings. For 
each variable V in Algorithm [TJ we also maintain a second representation, 
namely a double-precision floating point approximation to t(V). 

In lines [2] and [5j the norms are approximated by multiplying together the 
coordinates of t(V), rather than by computing an exact norm in Z. 

In line El we first approximate t(X/Y) by computing Ti(X)/ri(Y) (as a 
floating-point complex number) for each i. Applying the inverse of t yields 
an approximation to X/Y in K ®q R. We select Q by simply rounding 
each coordinate to the nearest integer. In the ideal situation we will have 
N(X/Y — Q) < 1. If this holds, then line [5] will succeed in updating X, 
and then we have made some progress in reducing the norm. However there 
is no guarantee that N(X/Y — Q) < 1 will occur. One possibility is that 
there exists some Q' £ Ok such that N(X/Y — Q') < 1, but that our simple- 
minded method for selecting Q did not locate it. To mitigate against this, we 
make a few attempts to adjust the coordinates of Q to locate a suitable Q' . 
This may still fail, and moreover it may turn out that there does not exist 
any Q' with the right property. This may occur if K is not Euclidean with 
respect to the norm; for example it is known that Q(C32) has this property 
[23j . In this case, we will fall through to lines [7HH1 

The goal of lines [THE! is to make some random perturbation, in the hope 
that we will be lucky in finding a good Q on the next iteration. In our 
implementation, we take S to be the set of elements of Ok of norm q, where 
q is the smallest prime q = 1 (mod e) (i.e. take all the conjugates of a 
generator of any prime ideal dividing qOK)- If we are lucky enough that U 
divides Y, then we know U cannot divide X, since we have assumed that 
the GCD has norm p, which is much larger than q. Thus dividing Y by 
U does not change the GCD. Otherwise, we simply multiply X by U and 
continue. This cannot change the GCD for the same reason. 

The rationale for this perturbation strategy is as follows. If X/Y is suf- 
ficiently close to an integer, then our method for selecting Q should find it. 
Otherwise, UX/Y is likely to be 'randomly distributed' modulo the integer 
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lattice, and there is a reasonable chance that it will be close to an inte- 
ger. We have not attempted to formulate this argument precisely or prove 
anything about it. 

Finally we discuss the issue of units. Whenever we compute a new element 
of Ok, say X, we examine the size of its coefficients, and compare this to 
N(X). If the coefficients are too large, we apply a balancing procedure, 
replacing X by u X for a suitable unit u S 0* K . This of course does not 
alter the GCD. An extreme example of an 'unbalanced' element is a high 
power of a nontrivial unit u £ K , which has large coefficients but norm 1. 
Without this balancing step, we soon encounter coefficient explosion (and 
overflow) . 

The condition we used to test for unbalancedness in our implementation is 
as follows: if X = cq + ciQ + • • • + we declare that X is unbalanced 

if \ YliZo \ c i\ > 10 1 N{X)\ l / d . There is no particular theoretical justification 
for this particular measure of size, nor of the choice of constant 10. We used 
it because it is fast to evaluate and seems to give good results in practice. 

To balance an element X we proceed as follows. (This strategy is in- 
spired by the definition of 'unbalanced' in [36].) Consider the logarithmic 
embedding L : Ok \ {0} ->■ R d / 2 defined by a H> (log |rj(a)|)v By Dirich- 
let's unit theorem, the image of the unit group 0* K under this map is a 
lattice of full rank in the hyperplane to + • • • + t&ii-x = 0. The vector 
(log|rj(X)| — \ log |i\T(X)|)j lies in this hyperplane. Armed with a pre- 
computed list of generators of 0* K (obtained for example via Pari), we may 
therefore use simple linear algebra over R to select a unit u so that log |Ti(it)| 
is close to log |rj(X)| — -j log |iV(X)| for all i. Then we replace X by u~ l X 
and continue. The rationale is that our choice of u ensures that |rj(n _1 X)| 
is close to \N(X)\ 1 / d for all i, so that the coefficients of u~ 1 X will be reason- 
ably small (although they might not actually satisfy the test for balancedness 
mentioned in the previous paragraph). 

4. Implementation and hardware 

Our implementation is written in C, using OpenMP for parallelisation 
at the level of the individual compute node. We used the GMP library 
|16| for multiple-precision integer arithmetic, with the following important 
exception. 

For very large integer multiplications — for operands exceeding around 
10 7 bits, depending on the hardware — we switch to our own implementation 
based on number-theoretic transforms (NTTs). This proceeds by splitting 
the input into small chunks of perhaps several words each, converting the 
problem to that of multiplying polynomials in Z[x]. This is then achieved 
by reducing modulo several suitable 62-bit primes q, multiplying the poly- 
nomials using FFTs over Z/gZ, and reconstructing the product in Z[x] via 
the Chinese Remainder Theorem. The FFT arithmetic is optimised using 
techniques described in [IT] . To ensure the running time behaves smoothly 



A SEARCH FOR WILSON PRIMES 



21 



as a function of the input size, we allow the number of primes to vary be- 
tween 3 and 6, and we select a transform length of the form 2 fc 3^ where 
< t < 6; that is, we use mainly radix-2 transforms, but allow a few lay- 
ers of radix-3 transforms. We use a strategy similar to Bailey's trick [T] to 
improve memory locality. 

The main reason that we did not use GMP's large integer multiplication 
code is that GMP does not take advantage of multiple cores in a shared 
memory environment. In contrast, our implementation is parallelised using 
OpenMP. This is crucial, because in Stage 1, the average complexity per 
prime is inversely proportional to the amout of RAM available. To make 
effective use of n cores, it is not good enough to process n intervals separately 
using one core each, since each core will have only 1/n of the available RAM, 
and will run in effect at 1/n of the speed. We must actually parallelise 
within the integer arithmetic, to get all cores working cooperatively on a 
single interval. 

Furthermore, our integer multiplication code is optimised heavily in favour 
of conserving memory. Its performance varies across platforms, but in all 
cases is competitive with GMP. For example, on a node of Katana (see be- 
low), multiplying two 1-gigabyte integers took 178s using GMP, with peak 
memory usage 9.1GB. Our code performs the same multiplication in 121s 
using only 5.3GB; running on 8 cores it takes 20s (a 6-fold speedup), using 
the same memory. 

A natural extension of this idea, which we did not pursue, is to increase the 
effective RAM available by making use of the fast networks on modern HPC 
systems to treat several nodes as a single computational unit. Whether this 
yields any speedup in searching for Wilson primes is an interesting question 
for future research. 

We ran our implementation over a period of about four months on sev- 
eral clusters at New York University ("Cardiac", "Bowery", and "Union 
Square"), the University of New South Wales ("Katana" and "Tensor"), 
and the National Computational Infrastructure facility at the Australian 
National University ("Vayu"). Table [3] summarises the characteristics of 
the nodes on each cluster, and the total CPU time expended on each clus- 
ter. Table H] gives a breakdown of the total CPU time into Stage 1, Stage 2 
and Stage 3. 

In the previous section it was pointed out that Stage 1 should dominate 
the computation for sufficiently large p. The data in Table H] shows that 
we have not yet reached this region. A more detailed accounting shows this 
behaviour beginning to occur in some parts of the computation; for example, 
for e = 2, on the machines with 32GB RAM, we found that Stage 1 starts 
to dominate for p around 5 x 10 12 . The threshold increases with e and with 
the amount of RAM per node. 

We used a client-server strategy to distribute work among the clusters. 
A master script ran on a server at NYU. When a compute node is ready 
to begin work, it sends a request via HTTP to the server. The server is 
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Cluster 


Architecture 


RAM (GB) 


Core-hours 


Cardiac 


AMD Barcelona 


32 


465 000 




16 cores, 2.3GHz 






Bowery 


Intel Nehalem 


48/96/256 


263 000 




12 cores, 2.67-3.07GHz 






Union Square 


Intel Xeon 


16/32 


154 000 




8 cores, 2.33GHz 






Tensor 


Intel Xeon 


16/24 


145 000 




8 cores, 3.0GHz 






Katana 


Intel Xeon 


24/96/144 


123 000 




12 cores, 2.8-3.06GHz 






Vayu 


Intel Nehalem 


24 


13 000 




8 cores, 2.93GHz 







Table 3. Cluster data 



Core-hours 


Stage 1 


464 000 


Stage 2 


655 000 


Stage 3 


44000 



Table 4. Breakdown of CPU time 



responsible for choosing a value of e (as in Section [3]) and a range of primes 
M < p < N to assign to that node. This basic outline is complicated by 
the fact that the time needed to complete a single block was generally much 
longer than the running time permitted for a single job by each cluster's job 
scheduler. It was therefore necessary to serialise intermediate computations 
to disk at appropriate intervals, and reload them by another job later on. 
Load balancing was also complicated by varying cluster availability over the 
duration of the project. 

Any computation of this size is bound to run into hardware failures and 
other problems. We took several measures to validate our results. 

First, for each p we check that our proposed valued for (p — 1)! (mod p 2 ) 
satisfies (p — 1)1 = — 1 (mod p). Second, in the notation of the proof of 
Theorem II 1\ we check that (— /!) e 7 is an eth root of unity modulo p. This 
simultaneously provides a strong verification of the cyclotomic GCD com- 
putation and of the computation of /!, at least modulo p. 

Finally, we wrote a completely independent program to compute w p using 
the 0(p°- 5+ °( 1 )) algorithm of [6], together with identity (pQ) (but none of 
the results of Section [3]). The underlying polynomial arithmetic is handled 
by the NTL library [31J. We ran this implementation on the 27039 026 
saved residues and found complete agreement. This computation was run 
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on Katana and Tensor, together with a Condor cluster, utilising idle time 
on machines in the School of Mathematics and Statistics at UNSW; it took 
440 000 CPU hours altogether. 
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